{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Julia for Data Science\n",
    "Prepared by [@nassarhuda](https://github.com/nassarhuda)! 😃\n",
    "\n",
    "`Last updated on 03/Jan/2020`\n",
    "\n",
    "In this tutorial, we will discuss why *Julia* is the tool you want to use for your data science applications.\n",
    "\n",
    "We will cover the following:\n",
    "* **Data**\n",
    "* Data processing\n",
    "* Visualization\n",
    "\n",
    "### Data: Build a strong relationship with your data.\n",
    "Every data science task has one main ingredient, the _data_! Most likely, you want to use your data to learn something new. But before the _new_ part, what about the data you already have? Let's make sure you can **read** it, **store** it, and **understand** it before you start using it.\n",
    "\n",
    "Julia makes this step really easy with data structures and packages to process the data, as well as, existing functions that are readily usable on your data. \n",
    "\n",
    "The goal of this first part is get you acquainted with some Julia's tools to manage your data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's download a csv file from github that we can work with.\n",
    "\n",
    "Note: `download` depends on external tools such as curl, wget or fetch. So you must have one of these."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P = download(\"https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv\",\"programminglanguages.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use shell commands like `ls` in Julia by preceding them with a semicolon."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ";ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And there's the *.csv file we downloaded!\n",
    "\n",
    "Now let's load it into Julia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the `DelimitedFiles` standard library package and its `readdlm()` function as shown\n",
    "below.\n",
    "\n",
    "(Today, the [CSV.jl](https://juliadata.github.io/CSV.jl/stable/) package is the recommended way to load CSVs in Julia. We can install it via `Pkg.add()`, and load .csv files using `CSV.read()`. This tutorial hasn't been updated to use CSV.jl yet.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# using Pkg\n",
    "# Pkg.add(\"CSV\") # for CSV.read()\n",
    "# Pkg.add(\"DelimitedFiles\") # for readdlm\n",
    "\n",
    "# using CSV\n",
    "# P = CSV.read(\"programminglanguages.csv\",header=true)\n",
    "# or\n",
    "using DelimitedFiles  # Standard library in Base"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, `readdlm` will fill an array with the data stored in the input .csv file. If we set the keyword argument `header` to `true`, we'll get a second output array for just the headers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P,H = readdlm(\"programminglanguages.csv\", ',', header=true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P # stores the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "H # stores the header names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we write our first small function. <br>\n",
    "Now you can answer questions such as, \"when was language X created?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "function language_created_year(P,language::String)\n",
    "    loc = findfirst(P[:,2].==language)\n",
    "    return P[loc,1]\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "language_created_year(P,\"Julia\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "language_created_year(P,\"julia\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, this will not return what you want, but thankfully, string manipulation is really easy in Julia!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "function language_created_year_v2(P,language::String)\n",
    "    loc = findfirst(lowercase.(P[:,2]).==lowercase.(language))\n",
    "    return P[loc,1]\n",
    "end\n",
    "language_created_year_v2(P,\"julia\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reading and writing to files is really easy in Julia.** <br>\n",
    "\n",
    "You can use different delimiters with the function `readdlm`, from the `DelimitedFiles` package. <br>\n",
    "\n",
    "To write to files, we can use `writedlm`. <br>\n",
    "\n",
    "Let's write this same data to a file with a different delimiter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "writedlm(\"programming_languages_data.txt\", P, '-')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now check that this worked using a shell command to glance at the file,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ";head -10 programming_languages_data.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and also check that we can use `readdlm` to read our new text file correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P_new_delim = readdlm(\"programming_languages_data.txt\", '-');\n",
    "P == P_new_delim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dictionaries\n",
    "Let's try to store the above data in a dictionary format!\n",
    "\n",
    "First, let's initialize an empty dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict = Dict{Integer,Vector{String}}()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we told Julia that we want `dict` to only accept integers as keys and vectors of strings as values.\n",
    "\n",
    "However, we could have initialized an empty dictionary without providing this information (depending on our application)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict2 = Dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dictionary takes keys and values of any type!\n",
    "\n",
    "Now, let's populate the dictionary with years as keys and vectors that hold all the programming languages created in each year as their values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i = 1:size(P,1)\n",
    "    year,lang = P[i,:]\n",
    "    \n",
    "    if year in keys(dict)\n",
    "        dict[year] = push!(dict[year],lang)\n",
    "    else\n",
    "        dict[year] = [lang]\n",
    "    end\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can pick whichever year you want and find what programming languages were invented in that year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict[2003]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataFrames! \n",
    "*Shout out to R fans!*\n",
    "One other way to play around with data in Julia is to use a DataFrame.\n",
    "\n",
    "This requires loading the `DataFrames` package. Thankfully, this tutorial came with a Project.toml file that specifies exactly which version of DataFrames to install..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Project.toml files\n",
    "\n",
    "For this tutorial (Julia for Data Science), you may have noticed that there are files in this folder called [`Project.toml`](/edit/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-for-data-science/Project.toml) and [`Manifest.toml`](/edit/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-for-data-science/Manifest.toml). These are files autogenerated by Julia's package manager, `Pkg`, that describe the _exact set of packages_ installed for a julia project, allowing you to share your work in a perfectly reproducible way.\n",
    "\n",
    "Jupyter was able to detect those `.toml` files, and so this notebook was automatically started with _this project activated!_ Note that this means any packages you add or remove inside this notebook will only affect this \"Julia for Data Science\" _project_.\n",
    "\n",
    "To install all of the package dependencies used in the rest of the tutorial, you only need to run this next cell (commands that start with `]` are package repl commands):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "] instantiate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can read more about package manager commands, here: https://docs.julialang.org/en/v1/stdlib/Pkg/index.html\n",
    "\n",
    "**Now back to DataFrames!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DataFrames\n",
    "df = DataFrame(year = P[:,1], language = P[:,2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can access columns by header name, or column index.\n",
    "\n",
    "In this case, `df[1]` is equivalent to `df.year` or `df[!, :year]`.\n",
    "\n",
    "Note that if we want to index columns by header name, we precede the header name with a colon. In Julia, this means that the header names are treated as *symbols*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**`DataFrames` provides some handy features when dealing with data**\n",
    "\n",
    "First, it uses julia's \"missing\" type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = missing\n",
    "typeof(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what happens when we try to add a \"missing\" type to a number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DataFrames` provides the `describe` function, which can give you quick statistics about each column in your dataframe "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "describe(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RDatasets\n",
    "\n",
    "We can use RDatasets to play around with pre-existing datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using RDatasets\n",
    "iris = dataset(\"datasets\", \"iris\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that data loaded with `dataset` is stored as a DataFrame. 😃"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "typeof(iris) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The summary we get from `describe` on `iris` gives us a lot more information than the summary on `df`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "describe(iris)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More on Missing Values\n",
    "\n",
    "Julia 1.0 and beyond has native support for `missing` values. (Before Julia 1.0, this was done via the DataArrays.jl package.)\n",
    "More information on using arrays with missing values can be found\n",
    "[in the Julia documentation](https://docs.julialang.org/en/v1/manual/missing/#Arrays-With-Missing-Values-1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "foods = [\"apple\", \"cucumber\", \"tomato\", \"banana\"]\n",
    "calories = [missing,47,22,105]\n",
    "typeof(calories)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Statistics  # julia's standard library for stats\n",
    "mean(calories)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing values ruin everything! 😑\n",
    "\n",
    "Luckily we can ignore them with `skipmissing`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean(skipmissing(calories))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oh WAIT! Detour. How did I get the emoji there?\n",
    "\n",
    "Try this out:\n",
    "\n",
    "```\n",
    "\\:expressionless: + <TAB>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "😑 = 0 # expressionless\n",
    "😀 = 1\n",
    "😞 = -1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Back to missing values*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prices = [0.85,1.6,0.8,0.6,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframe_calories = DataFrame(item=foods,calories=calories)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframe_prices = DataFrame(item=foods,price=prices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also `join` two dataframes together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DF = join(dataframe_calories,dataframe_prices,on=:item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### FileIO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using FileIO\n",
    "julialogo = download(\"https://avatars0.githubusercontent.com/u/743164?s=200&v=4\",\"julialogo.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let's check that this download worked!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ";ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Next, let's load the Julia logo, stored as a .png file\n",
    "\n",
    "**NOTE: You may see errors below, that certain Image packages could not be found. If so:**\n",
    " - This is because these packages are specific to your OS, so aren't installed by default.\n",
    " - Simply run the suggested commands to install the packages, then try again. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These commands may vary depending on your operating system.\n",
    "#import Pkg; Pkg.add(\"QuartzImageIO\")\n",
    "#import Pkg; Pkg.add(\"ImageMagick\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X1 = load(\"julialogo.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see here that Julia stores this logo as an array of colors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@show typeof(X1);\n",
    "@show size(X1);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And if we load the Images package, it will display in Jupyter as an image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File types\n",
    "In Julia, many file types are supported so you do not have to transfer a file you from another language to a text file before you read it.\n",
    "\n",
    "*Some packages that achieve this:*\n",
    "MAT CSV NPZ JLD FASTAIO\n",
    "\n",
    "\n",
    "Let's try using MAT to write a file that stores a matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using MAT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A = rand(5,5)\n",
    "matfile = matopen(\"densematrix.mat\", \"w\") \n",
    "write(matfile, \"A\", A)\n",
    "close(matfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try opening densematrix.mat with MATLAB!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "newfile = matopen(\"densematrix.mat\")\n",
    "read(newfile,\"A\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names(newfile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "close(newfile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.0.5",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
